12 research outputs found

    Vers la transformation de la parole oesophagienne en voix laryngée à l'aide de techniques de conversion vocale

    Get PDF
    National audienceCe travail concerne le développement d'un système de conversion de voix oesophagienne dans le but est de rendre plus intelligible celle-ci. La conversion de voix est une technique de transformation d'un signal de parole d'un locuteur source, de manière à ce qu'il semble, à l'écoute, être prononcé par un locuteur cible. Etant donnée la spécificité de la voix oesophagienne, nous proposons dans cette étude d'appliquer une nouvelle technique de conversion vocale en tenant compte de la particularité de l'appareil vocal des patients qui ont subi une ablation de larynx. En effet, l'ablation des cordes vocales perturbe profondément le signal glottique et par conséquent la voix oesophagienne acquise par le patient laryngectomisé est difficile à comprendre, rauque et faible en intensité. Dans la littérature, plusieurs techniques de conversion des voix ont été proposées, parmi lesquelles, la technique du codage linéaire prédictif pour la conversion vocale [1] et la régression linéaire multi-variée [2] qui vise à réduire la discontinuité et la distorsion spectrale

    Enhancement of esophageal speech using voice conversion techniques

    Get PDF
    International audienceThis paper presents a novel approach for enhancing esophageal speech using voice conversion techniques. Esophageal speech (ES) is an alternative voice that allows a patient with no vocal cords to produce sounds after total laryngectomy: this voice has a poor degree of intelligibility and a poor quality. To address this issue, we propose a speaking-aid system enhancing ES in order to clarify and make it more natural. Given the specificity of ES, in this study we propose to apply a new voice conversion technique taking into account the particularity of the pathological vocal apparatus. We trained deep neural networks (DNNs) and Gaussian mixture models (GMMs) to predict " laryngeal " vocal tract features from esophageal speech. The converted vectors are then used to estimate the excitation cepstral coefficients and phase by a search in the target training space previously encoded as a binary tree. The voice resynthesized sounds like a laryngeal voice i.e., is more natural than the original ES, with an effective reconstruction of the prosodic information while retaining , and this is the highlight of our study, the characteristics of the vocal tract inherent to the source speaker. The results of voice conversion evaluated using objective and subjective experiments , validate the proposed approach

    Une nouvelle méthodologie prédictive fondée sur un modèle séquence à séquence utilisé pour la transformation de la parole œsophagienne en voix laryngée

    Get PDF
    La situation sanitaire ne permettant pas d’organiser les 9èmes Journées de Phonétique Clinique dans les meilleures conditions (à savoir en présentiel), le comité de programme a décidé d’annuler cette édition 2021 et d’organiser à la place une journée dédiée à la présentation des contributions acceptées le 27 mai 2021.National audienc

    Rehaussement de la parole œsophagienne par une technique de conversion de voix fondée sur l'estimation de multiples réseaux de neurones profonds et de fonctions de conversion spectrale

    Get PDF
    La situation sanitaire ne permettant pas d’organiser les 9èmes Journées de Phonétique Clinique dans les meilleures conditions (à savoir en présentiel), le comité de programme a décidé d’annuler cette édition 2021 et d’organiser une journée dédiée à la présentation des contributions acceptées le 27 mai 2021.National audienc

    Conversion de la voix : Approches et applications

    Get PDF
    Voice conversion (VC) is an important problem in the field of audio signal processing.The goal of voice conversion is to transform the speech signal of a source speakersuch that it sounds as if it had been uttered by a target speaker while preserving thesame linguistic content of the original signal. Gaussian mixture model (GMM) basedconversion is the most commonly used technique in VC, but is often sensitive to overfittingand oversmoothing. To address these issues, we propose a secondary classificationby applying a K-means classification in each class obtained by a primary classificationin order to obtain more precise local conversion functions. This proposal avoids theneed for complex training algorithms because the estimated local mapping functionsare determined at the same time.The second contribution of this thesis, includes a new methodology for designingthe relationship between two sets of spectral envelopes. Our systems perform by :1) cascading Deep Neural Networks with Gaussian Mixture Models for constructingDNN-GMM and GMM-DNN-GMM models in order to find an efficient global mappingrelationship between the cepstral vectors of the two speakers ; 2) using a newspectral synthesis process with excitation and phase extracted from the target trainingspace encoded as a KD-tree.Experimental results of the proposed methods exhibit a great improvement in intelligibility,quality and naturalness of the converted speech signals when compared withthose obtained by a baseline conversion method. The extraction of excitation and phasefrom the target training space, allows the preservation of target speaker’s identity.Our last contribution of this thesis concerns the implementation of a novel speakingaidsystem for enhancing esophageal speech (ES). The method adopted in this thesisaims to improve the quality of esophageal speech using a combination of a voiceconversion technique and a time dilation algorithm. In the proposed system, a DeepNeural Network (DNN) is used as a nonlinear mapping function for vocal tract vectorsconversion. Then the converted frames are used to determine realistic excitationand phase vectors from the target training space using a frame selection algorithm. Wedemonstrate that that our proposed method provides considerable improvement in intelligibilityand naturalness of the converted esophageal stimuli.La conversion vocale est un problème important dans le domaine du traitement dusignal audio. Le but de la conversion de voix est de transformer le signal de paroled’un locuteur source de telle sorte qu’il soit perçu comme s’il avait été prononcé par unlocuteur cible tout en conservant le contenu linguistique du signal converti identiqueà celui du signal d’origine. La conversion basée sur un modèle de mélange gaussien(GMM) est la technique la plus couramment utilisée dans le domaine de la conversionvocale, mais elle est souvent sensible aux problèmes de sur-apprentissage et de lissageexcessif. Pour résoudre ces problèmes, nous proposons une classification secondaire enappliquant une classification, par la technique des K-moyennes, dans chaque classe obtenuepar une classification primaire afin d’obtenir des fonctions de conversion localesplus précises. Cette proposition évite le recours à des algorithmes d’apprentissage complexescar les fonctions de transformation locales sont déterminées en même temps.La deuxième contribution de cette thèse inclut une nouvelle méthodologie pourconcevoir la relation entre deux ensembles d’enveloppes spectrales. Nos systèmes fonctionnent: 1) en cascadant des réseaux de neurones profonds avec un modèle de mélangegaussien pour construire des modèles DNN-GMM et GMM-DNN-GMM, ceciafin de trouver une fonction de transformation performante entre les vecteurs cepstrauxdes deux locuteurs ; 2) en utilisant un nouveau processus de synthèse spectralemettant en oeuvre des prédicteurs de cepstres en cascade avec une excitation et unephase extraites de l’espace d’apprentissage cible codé sous la forme d’un arbre binaireKD-tree.Les résultats expérimentaux des méthodes proposées exhibent une nette améliorationde l’intelligibilité, de la qualité et du naturel des signaux de parole convertis parrapport aux résultats obtenus avec une méthode de conversion de base. L’extraction del’excitation et de la phase de l’espace d’apprentissage cible permet de préserver l’identitédu locuteur cible.Notre dernière contribution de cette thèse concerne l’implémentation d’un nouveausystème d’aide à la parole pour améliorer la parole oesophagienne (ES). La méthodeadoptée dans cette thèse vise à améliorer la qualité de la voix oesophagienne en combinantune technique de conversion vocale et un algorithme de dilatation temporelle.Dans le système proposé, un réseau de neurones profonds (DNN) est utilisé pour transformerde manière non linéaire les vecteurs cepstraux relatifs au conduit vocal. Ensuite,les trames converties obtenues sont utilisées pour déterminer les vecteurs d’excitationet de phase réalistes à partir de l’espace d’apprentissage cible préalablement codé sousla forme d’un arbre binaire. Nous montrons que la méthode proposée améliore considérablementl’intelligibilité et le naturel de la voix oesophagienne convertie

    Improving the computational performance of standard GMM-based voice conversion systems used in real-time applications

    No full text
    International audienceVoice conversion (VC) can be described as finding a mapping function which transforms the features extracted from a source speaker to those of a target speaker. Gaussian mixture model (GMM) based conversion is the most commonly used technique in VC, but is often sensitive to overfitting and oversmoothing. To address these issues, we propose a secondary classification by applying a K-means classification in each class obtained by a primary classification in order to obtain more precise local conversion functions. This proposal avoids the need for complex training algorithms because the local mapping functions are determined at the same time. The proposed approach consists of a Fourier cepstral analysis, followed by a training phase in order to find the local mapping functions which transform the vocal tract characteristics of the source speaker into those of the target speaker. The converted parameters together with excitation and phase extracted from the target training space using a frame index selection are used in the synthesis step to generate a converted speech with target speech characteristics. Objective and subjective experiments prove that the proposed technique outperforms the baseline GMM approach while greatly reducing the training and transformation computation times

    Enhancement of esophageal speech using statistical and neuromimetic voice conversion techniques

    No full text
    International audienceThis paper presents a novel approach for enhancing esophageal speech using voice conversion techniques. Esophageal speech (ES) is an alternative voice that allows a patient with no vocal cords to produce sounds after total laryngectomy: although it doesn't need any external devices, this voice sounds unnatural when compared to laryngeal speech. ES is frequently described as a harsh speech with low pitch frequency and loudness. Consequently, ES has a poor degree of intelligibility and a poor quality. To improve naturalness and intelligibility of esophageal speech, we propose a speaking-aid system enhancing ES in order to clarify and make it more natural. Given the specificity of ES, in this study, we propose to apply a new voice conversion technique taking into account the particularity of the pathological vocal apparatus. The vocal tract and excitation cepstral coefficients are separately estimated. We trained deep neural networks (DNNs) and Gaussian mixture models (GMMs) to predict "laryngeal" vocal tract features from esophageal speech. The converted cepstral vectors are then used to estimate excitation and phase coefficients by a search in the target training space previously encoded as a binary tree. The voice resynthesized sounds like a laryngeal voice, i.e., is more natural than the original ES, with an effective reconstruction of the prosodic information while retaining, and this is the highlight of our study, the characteristics of the vocal tract inherent to the source speaker. The results of voice conversion evaluated using objective and subjective experiments, validate the proposed approach

    A novel voice conversion approach using cascaded powerful cepstrum predictors with excitation and phase extracted from the target training space encoded as a KD-tree

    No full text
    International audienceVoice conversion is an important problem in audio signal processing. The goal of voice conversion is to transform the speech signal of a source speaker such that it sounds as if it had been uttered by a target speaker. Our contribution in this paper includes a new methodology for designing the relationship between two sets of spectral envelopes. Our systems perform by: (1) cascading deep neural networks and Gaussian mixture model to construct DNN–GMM and GMM–DNN–GMM models in order to find a global mapping relationship between the cepstral vectors of the two speakers; (2) using a new spectral synthesis process with cascaded cepstrum predictors and excitation and phase extracted from the target training space encoded as a KD-tree. Experimental results of the proposed methods exhibit a great improvement of the intelligibility, the quality and naturalness of the converted speech signals when compared with stimuli obtained by baseline conversion methods. The extraction of excitation and phase from the target training space, permits the preservation of target speaker’s identity

    Enhancement of esophageal speech using statistical and neuromimetic voice conversion techniques

    No full text
    International audienceThis paper presents a novel approach for enhancing esophageal speech using voice conversion techniques. Esophageal speech (ES) is an alternative voice that allows a patient with no vocal cords to produce sounds after total laryngectomy: although it doesn't need any external devices, this voice sounds unnatural when compared to laryngeal speech. ES is frequently described as a harsh speech with low pitch frequency and loudness. Consequently, ES has a poor degree of intelligibility and a poor quality. To improve naturalness and intelligibility of esophageal speech, we propose a speaking-aid system enhancing ES in order to clarify and make it more natural. Given the specificity of ES, in this study, we propose to apply a new voice conversion technique taking into account the particularity of the pathological vocal apparatus. The vocal tract and excitation cepstral coefficients are separately estimated. We trained deep neural networks (DNNs) and Gaussian mixture models (GMMs) to predict "laryngeal" vocal tract features from esophageal speech. The converted cepstral vectors are then used to estimate excitation and phase coefficients by a search in the target training space previously encoded as a binary tree. The voice resynthesized sounds like a laryngeal voice, i.e., is more natural than the original ES, with an effective reconstruction of the prosodic information while retaining, and this is the highlight of our study, the characteristics of the vocal tract inherent to the source speaker. The results of voice conversion evaluated using objective and subjective experiments, validate the proposed approach

    Enhancement of esophageal speech obtained by a voice conversion technique using time dilated Fourier cepstra

    No full text
    International audienceThis paper presents a novel speaking-aid system for enhancing esophageal speech (ES). The method adopted in this paper aims to improve the quality of esophageal speech using a combination of a voice conversion technique and a time dilation algorithm. In the proposed system, a Deep Neural Network (DNN) is used as a nonlinear mapping function for vocal tract vector transformation. Then the converted frames are used to determine realistic excitation and phase vectors from the target training space using a frame selection algorithm. Next, in order to preserve speaker identity of the esophageal speakers, we use the source vocal tract features and propose to apply on them a time dilation algorithm to reduce the unpleasant esophageal noises. Finally the converted speech is reconstructed using the dilated source vocal tract frames and the predicted excitation and phase. Deep Neural Network (DNN) and Gaussian Mixture model (GMM) based voice conversion systems have been evaluated using objective and subjective measures. Such an experimental study has been realized also in order to evaluate the changes in speech quality and intelligibility of the transformed signals. Experimental results demonstrate that the proposed methods provide considerable improvement in intelligibility and naturalness of the converted esophageal speech
    corecore